Introduction

Social networks such as Facebook are becoming more and more important for online news services: an increasing number of their readers access the news pages via links in the networks. Users of Facebook, for example, can use their profile to share links to external websites - such as news portals - with their online friends. This has led to the development of social media into an important generator of traffic on the internet pages. In Germany, 94% of online shared news articles in 2015 are distributed via Facebook, followed by Twitter with 3.5% and Google+ with 2.3% . The advertising-financed business model of the media houses is based on the premise that users visit their websites in order to achieve high advertising revenues. For this reason, news agencies are particularly interested in finding out which topics are more likely shared on these platforms. show, that social media users choose a certain site depending on the researched topic. FOCUS Online for example is targeted for articles from politics and business, sports news is more likely to be shared from Bild.de.

While these static resorts give an indication on the content of an article, multiple articles in the same resort probably don’t cover the same topics (and are not equally shared). Especially if the articles originate from different news portals. Furthermore, articles can contain more than one topic. We use a structural topic model to reveal the underlying topics of a collection of articles (a corpus), and how the articles exhibit them. We then estimate the effect of topic prevalence on the number of Facebook shares.

Methodology

Mapping raw text to one or more topics, without having prior knowledge on what those topics are, translates to an unsupervised classification problem on natural language. Within topic models the Latent Dirichlet Allocation (LDA) is a widely used technique, where each document (article) is viewed as a mixture of topics (represented by the document-topic distribution) and each topic is a mixture of unique terms (represented by the topic-term distribution).

This model views the text generation process as conforming to the following characteristics:

  • A topic is a mixture of words. A topic is comprised of many words, and each word maps to one or more topics.
  • A document (in this case, a news article), is a mixture of topics. Each document can be thought of as containing a proportion of words from each topic.

To “learn” the topic prevalence and the topic-term distribution, collapsed Gibbs sampling can be used. One of the important considerations of this model is that the number of topics \(k\) must be known a-priori.

Preliminaries

We load libraries needed for this analysis.

suppressPackageStartupMessages({
  library(dplyr)       # Data manipulation
  library(stringr)     # String manipulation
  library(lubridate)   # Date and time manipulation
  library(purrr)       # Functional programming
  library(tidyr)       # Reshaping
  library(magrittr)    # Advanced piping
  library(pushoverr)   # Pushover notifications
  library(doMC)        # Parallel Computing
  library(readr)       # Importing data
  library(tibble)      # Better data frames
  
  library(ggplot2)     # Static data visualization
  library(ggrepel)     # Repel text labels
  library(ggiraph)     # GGplot interactive
  library(scales)      # Scales
  library(viridis)     # Viridis color scales
  library(htmlwidgets) # JS visuliaztions
  library(htmltools)   # Arbitrary html
  library(ggjoy)       # Create joyplots
  library(gganimate)   # Animating ggplots
  library(tweenr)      # Tweening charts
  
  library(httr)        # HTTP functions
  library(jsonlite)    # JSON parsing
  
  library(tidytext)    # Tidy text mining
  library(tm)    # Tidy text mining
  library(hunspell)    # Text processing
  library(stringdist)  # String distances
  library(topicmodels) # Topic modelling
  library(proxy)       # Distance measures
  library(SnowballC)   # Stemming
})

# Theming
quartzFonts(
  Roboto = 
    c("Roboto-Light",
      "Roboto-Bold",
      "Roboto-Regular",
      "Roboto-Thin")
)

theme_set(
  theme_bw(base_family = "Roboto", base_size = 14) +
    theme(
      plot.title = element_text(face = "bold", size = 14, 
                                margin = margin(0, 0, 4, 0, "pt")),
      plot.subtitle = element_text(size = 12),
      plot.caption = element_text(size = 6, hjust = 0),
      axis.title = element_text(size = 10),
      panel.border = element_blank()
    )
)

rm(list=ls())
# Functions
source("func/02-topic-modelling-functions.R")

Pre-processing

Shaping

btw %>% 
  # Tokenize by word
  unnest_tokens(word, text_cleaned, token = "words") ->
  # Assign to variable
  posts_tokenized.dt

We remove words that are less than 3 characters long.

# Remove short words
posts_tokenized.dt %>% 
  filter(str_length(word) >= 3) ->
  posts_tokenized.dt

Filtering and Casting

We summarise the data format into a more compressed form and convert it into a document term matrix.

# Summarise
posts_tokenized.dt %>% 
  group_by(articleID, word) %>%
  dplyr::summarise(term_frequency = n()) %>%
  ungroup() ->
  posts_tokenized.dt

Analysis

Topic modelling

LDA Training

Build Corpus
Run LDA using Gibbs sampling

We then train the LDA on the full dataset, with \(k = 40\).

Training took 50 minutes on the full dataset. We extract the word-topic probabilities, and the document-topic probabilities produced by the model.

alternatively…

Topic Labeling

# Word-topic probabilities
ldaOut %>% tidy("beta") -> posts.wtp

# Document-topic probabilities
ldaOut %>% tidy("gamma") -> posts.dtp

We can then assess the word-topic probabilities in order to get an idea of the topic that is most.

Since we have over 30,000 unique terms in the corpus, we need to extract the top few words that most uniquely define each topic, so that we can more easily visualize and label them. We use the measure of relevance defined by Sievert and Shirley (2014)1. Relevance of term \(w\) to topic \(k\) given a weight parameter \(\lambda\) between 0 and 1 \(r(w, k | \lambda)\) is computed as:

\[ r(w, k | \lambda) = \lambda\log(\phi_{kw}) + (1-\lambda)\log(\frac{\phi_{kw}}{p_w}) \]

where \(\phi_{kw}\) is the probability of term \(w\) for topic \(k\), and \(p_w\) is the empirical probability of the word in the corpus. \(\lambda\) can be thought of as a weighting term between ranking by the probability of that word within the topic, and ranking by the lift over the overall probability in the corpus.

In user studies, Sievert and Shirley (2004) found that a lambda value of 0.6 as an optimal value for allowing humans to identify the topics associated with the top words ranked by relevance. We use this same value for lambda.

posts.wtp %>%
  # Compute lambda and phi_kw
  mutate(lambda = 0.6, phi_kw = beta) %>% 
  # Compute and join the p_w
  left_join(
    posts_tokenized.dt %>%
      group_by(word) %>% 
      dplyr::summarise(frequency = sum(term_frequency)) %>% 
      ungroup() %>% 
      mutate(p_w = frequency/sum(frequency)) %>% 
      select(-frequency),
    by = c("term" = "word")
  ) %>%
  # Compute the relevance
  mutate(relevance = lambda * log(phi_kw) + (1 - lambda) * log(phi_kw/p_w)) ->
  word_relevance.dt

We take a look at samples of articles labeled with the topics and the most relevant words that use them, and label them according to the prevalent theme. These are recorded in a csv file and loaded in:

### Most likely Topics per Article

# The topics function from the package is used to extract the most likely topic for each document 
btw.topics <- topics(ldaOut, 2)

# Create Dataframe
doctopics.df <- as.data.frame(t(btw.topics))

doctopics.df %>%
  transmute(title = rownames(.),
            topic = V1) -> doctopics.df

doctopics.df$articleID <- as.integer(rownames(doctopics.df))

# Add Topic to origian DF
topics_mapping.dt <- btw %>%
  mutate(articleID = as.integer(articleID)) %>%
  inner_join(.,doctopics.df, by="articleID") %>%
  
  group_by(topic) %>%
  slice(1) %>%
  select(topic, title.text, articleID) 

Create Topic Labels

## Return the top 30 terms.
btw.terms <- as.data.frame(terms(ldaOut, 30), stringAsFactors = FALSE)


topicTerms <- btw.terms %>% gather(Topic)
topicTerms <- cbind(topicTerms, Rank = rep(1:30))
topicTerms <- topicTerms %>% filter(Rank < 5)

topicTerms <- topicTerms %>% mutate(topic = stringr::word(Topic, 2))
topicTerms$topic <- as.numeric(topicTerms$topic)

topicLabel <- data.frame()
for (i in 1:40){
  z <- dplyr::filter(topicTerms, topic==i)
  l <- as.data.frame(paste(z[1,2], z[2,2], z[3,2], z[4,2], sep = " "), stringAsFactors = FALSE)
  topicLabel <- rbind(topicLabel, l)
}
colnames(topicLabel) <- c("topic_name")
topicLabel$topic <- as.integer(rownames(topicLabel))
## Combine with Topic label
topics_mapping.dt %>%
  left_join(., topicLabel, by="topic") -> topics_mapping.dt

We extract the top 30 most relevant words and plot them as follows:

  1. Compute the Jensen-Shannon divergence between each of the topics according to the differences in their word-topic probabilities and construct a distance matrix,
  2. Apply principal coordinates analysis in order to project the distance matrix down to two dimensions.
  3. Plot each word near the center of the topic mass and repel them from each other.
Facebook News Map

Facebook News Map

Now that we have labelled each topic, we produce sample documents classified to that topic for reference. We produce a random sample of 10 posts from the 300 highest probability fits per topic.

set.seed(9272)
posts.dtp %>% 
  group_by(title = document) %>% 
  summarise(topic = min(topic[gamma == max(gamma)]), gamma = max(gamma)) %>% 
  ungroup() %>% 
  inner_join(btw %>% select(title, articleID), by = "title") %>% 
  inner_join(topics_mapping.dt, by = "topic") %>% 
  mutate(topic_title = 
           paste0("Topic ", formatC(topic, flag = "0", width = 2), 
                  " - ", topic_name)) %>% 
  group_by(topic_name) %>% 
  top_n(300, gamma) %>%
  sample_n(10) %>% 
  mutate(row = row_number()) %>% 
  ungroup() ->
  posts_classification.sdt

We produce a chart that shows this in a presentable manner.

Facebook news samples

Facebook news samples

Topic Distribution by News Page

Distribution

A common accusation leveled against traditional news media is the amount of bias in reporting various topics. We try to explore the topic distribution of news articles.

set.seed(7292)
posts_classification.dt %>% 
  inner_join(
    btw %>% select(document = title, news_page = site, title),
    by = "document"
  ) %>%
  inner_join(topics_mapping.dt, by = "topic") %>% 
  group_by(news_page, topic_name) %>%
  summarise(
    articles = sum(allocation),
    sample_articles = 
      paste0(
        "<li>", sample(document, pmin(3, length(document))), "</li>",
        collapse = "<br>"
      )
  ) %>%
  ungroup() %>% 
  group_by(news_page) %>% 
  mutate(perc_articles = articles/sum(articles)) %>% 
  ungroup() %>%
  # left_join(
  #   top_100_fbpages.dt %>% select(news_page = page_name, page_title), 
  #   by = "news_page"
  # ) %>% 
  mutate(
    tooltip_text = 
      paste0(
        "<b>", comma(round(articles, 0)), " (", percent(perc_articles), 
        ") </b> of articles from <b>", news_page, 
        "</b> fall under <b>", topic_name, "</b><br><br>Sample Articles:<br><ul>",
        sample_articles, "</ul>"
      ) %>% str_replace_all("'", "")
  ) ->
  news_page_distribution.dt
{
  news_page_distribution.dt %>% 
    ggplot(aes(x = news_page, y = perc_articles, fill = topic_name)) +
    geom_bar_interactive(
      aes(tooltip = tooltip_text),
      position = position_stack(), 
      stat = "identity", color = "darkgray"
    ) +
    scale_y_continuous(
      name = "Percent of Articles", 
      labels = percent,
      expand = c(0, 0)
    ) +
    theme(
      legend.position = "none",
      axis.title.y = element_blank(),
      legend.background = element_rect("#fafafa"),
      plot.background   = element_rect("#fafafa", "#fafafa"),
      plot.subtitle     = element_text(size = 8),
      plot.caption      = element_text(size = 7),
      plot.title        = element_text(size = 18)
    ) +
    coord_flip() +
    scale_fill_viridis(discrete = TRUE) +
    labs(
      title = "BIAS AND BALANCE",
      subtitle = "Topic distribution of online news articles per news page, 2017\n(Hover over each bar to see the topic and sample articles)",
      caption = "
DATA SOURCE: Facebook Graph API

CHART NOTES:
  1. The news pages are arranged in descending order in terms of number of articles
  2. Topics were derived through Latent Dirichlet Allocation (LDA) with 40 topics
  3. Articles were classified according to the topic that has the highest probability
      "
    )
} %>% 
  ggiraph(
    ggobj = .,
    width_svg = 12,
    height_svg = 8,
    width = 1,
    tooltip_extra_css = "
    font-family: Roboto; 
    background-color: #000; 
    color: #fff; font-size: 10px;
    padding: 5px;"
  ) %>% 
  prependContent(
    tags$link(
      href = "https://fonts.googleapis.com/css?family=Roboto:400,700", 
      rel = "stylesheet"
    )
  ) -> chart_topic_distribution.wdgt

saveWidget(
  widget = chart_topic_distribution.wdgt,
  file = "../figs/04-topic-distribution-pages.html",
  selfcontained = FALSE,
  libdir = "../figs/js",
  background = "#fafafa"
)

chart_topic_distribution.wdgt

Topic Concentration

A common criticism leveled against news media sites is that there is undue emphasis on a particular topic. One way to measure concentration across different topics is the Herfindahl-Hirschman Index (HHI) which is simply \(\Sigma_n{s^2}\) where \(s\) is the share of each topic.

{
  news_page_distribution.dt %>% 
    group_by(news_page) %>% 
    arrange(desc(articles)) %>% 
    summarise(
      hh_index  = sum(perc_articles ^ 2),
      t3_topics = paste0(topic_name %>% head(3), collapse = "<br>")
    ) %>% 
    ungroup() %>% 
    arrange(hh_index) %>% 
    mutate(news_page = factor(news_page, unique(news_page))) %>% 
    ggplot(aes(x = news_page, y = hh_index * 100)) +
    geom_bar_interactive(
      aes(
        tooltip = 
          paste0("<b>Top Topics:</b><br>", t3_topics) %>% 
          str_replace_all("'", "")
      ),
      stat = "identity", 
      fill = c("#008B45")
    ) +
    geom_text(aes(label = comma(round(hh_index * 100, 1))), hjust = 1.5, 
              color = "white", family = "Roboto") +
    scale_y_log10(
      name = "Herfindahl-Hirschman Index (HHI)\n(100 = most concentrated)",
      expand = c(0, 0)
    ) +
    labs(
      title = "TOPIC CONCENTRATION OF PHILIPPINE NEWS PAGES",
      subtitle = "Herfindahl-Hirschman Index (HHI)",
      caption = "
DATA SOURCE: Facebook Graph API

CHART NOTES:
  1. Hover over each tile to see infomration and to see top topics
  2. Topics were derived through Latent Dirichlet Allocation (LDA) with 40 topics
  3. Articles were classified according to the topic that has the highest probability

TROY JAMES PALANCA | TJPALANCA.COM
      "
    ) +
    coord_flip() +
    theme(
      axis.title.y       = element_blank(),
      panel.grid.major.y = element_blank(),
      plot.background   = element_rect("#fafafa", "#fafafa"),
      plot.subtitle      = element_text(size = 10),
      plot.caption       = element_text(size = 8),
      plot.title         = element_text(size = 16)
    )
} %>% 
  ggiraph(
    ggobj = .,
    width_svg = 10,
    height_svg = 8,
    width = 1,
    tooltip_extra_css = "
    font-family: Roboto; 
    background-color: #000; 
    color: #fff; font-size: 16px;
    padding: 5px;"
  ) %>% 
  prependContent(
    tags$link(
      href = "https://fonts.googleapis.com/css?family=Roboto:400,700", 
      rel = "stylesheet"
    )
  ) -> chart_topic_concentration.wdgt

saveWidget(
  widget = chart_topic_concentration.wdgt,
  file = "../figs/05-topic-concentration.html",
  selfcontained = FALSE,
  libdir = "../figs/js",
  background = "#fafafa"
)

chart_topic_concentration.wdgt